AITopics | test collection

Collaborating Authors

test collection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Understanding Bias in Synthetic Data for Evaluation

Rahmani, Hossein A., Ramineni, Varsha, Yilmaz, Emine, Craswell, Nick, Mitra, Bhaskar

arXiv.org Artificial IntelligenceOct-7-2025

Test collections are crucial for evaluating Information Retrieval (IR) systems. Creating a diverse set of user queries for these collections can be challenging, and obtaining relevance judgments, which indicate how well retrieved documents match a query, is often costly and resource-intensive. Recently, generating synthetic datasets using Large Language Models (LLMs) has gained attention in various applications. While previous work has used LLMs to generate synthetic queries or documents to improve ranking models, using LLMs to create synthetic test collections is still relatively unexplored. Previous work~\cite{rahmani2024synthetic} showed that synthetic test collections have the potential to be used for system evaluation, however, more analysis is needed to validate this claim. In this paper, we thoroughly investigate the reliability of synthetic test collections constructed using LLMs, where LLMs are used to generate synthetic queries, labels, or both. In particular, we examine the potential biases that might occur when such test collections are used for evaluation. We first empirically show the presence of such bias in evaluation results and analyse the effects it might have on system evaluation. We further validate the presence of such bias using a linear mixed-effects model. Our analysis shows that while the effect of bias present in evaluation results obtained using synthetic test collections could be significant, for e.g.~computing absolute system performance, its effect may not be as significant in comparing relative system performance. Codes and data are available at: https://github.com/rahmanidashti/BiasSyntheticData.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.10301

Country: North America > United States (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.43)

Add feedback

Fashion-AlterEval: A Dataset for Improved Evaluation of Conversational Recommendation Systems with Alternative Relevant Items

Vlachou, Maria

arXiv.org Artificial IntelligenceJul-25-2025

In Conversational Recommendation Systems (CRS), a user provides feedback on recommended items at each turn, leading the CRS towards improved recommendations. Due to the need for a large amount of data, a user simulator is employed for both training and evaluation. Such user simulators critique the current retrieved item based on knowledge of a single target item. However, system evaluation in offline settings with simulators is limited by the focus on a single target item and their unlimited patience over a large number of turns. To overcome these limitations of existing simulators, we propose Fashion-AlterEval, a new dataset that contains human judgments for a selection of alternative items by adding new annotations in common fashion CRS datasets. Consequently, we propose two novel meta-user simulators that use the collected judgments and allow simulated users not only to express their preferences about alternative items to their original target, but also to change their mind and level of patience. In our experiments using the Shoes and Fashion IQ as the original datasets and three CRS models, we find that using the knowledge of alternatives by the simulator can have a considerable impact on the evaluation of existing CRS models, specifically that the existing single-target evaluation underestimates their effectiveness, and when simulatedusers are allowed to instead consider alternative relevant items, the system can rapidly respond to more quickly satisfy the user.

machine learning, natural language, simulator, (18 more...)

arXiv.org Artificial Intelligence

2507.18017

Country: Europe > Czechia (0.16)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Overview of the TREC 2022 deep learning track

Craswell, Nick, Mitra, Bhaskar, Yilmaz, Emine, Campos, Daniel, Lin, Jimmy, Voorhees, Ellen M., Soboroff, Ian

arXiv.org Artificial IntelligenceJul-16-2025

At TREC 2022, we hosted the fourth TREC Deep Learning Track continuing our focus on benchmarking ad hoc retrieval methods in the large-data regime. As in previous years [Craswell et al., 2020, 2021a, 2022], we leverage the MS MARCO datasets [Bajaj et al., 2016] that made hundreds of thousands of human annotated training labels available for both passage and document ranking tasks. In addition, last year we refreshed both the passage and the document collections which also led to a nearly 16 times increase in the size of the passage collection and nearly four times increase in the document collection size. In addition to evaluating ranking methods on the larger collections, the data refresh also aimed at providing additional metadata-- e.g., passage-to-document mappings--that may be useful for ranking, as well as incorporating some fixes for known text encoding issues in previous versions of the datasets. This year we continue to benchmark against these larger passage and document collections. However, the significant increase in collection sizes last year led to a corresponding increase in the number of relevant results in the collection per query and the existing judgment budget was exceeded before a reasonably complete set of these relevant results could be identified by the NIST judges. This large number of relevant raised serious concerns about the test collection generated by last year's track, relating to reusability and also score saturation [V oorhees et al., 2022, Craswell et al., 2022]. To address these concerns, we made three changes this year with the goal of reducing the number of relevant results per query and in general the judgment costs so that they may be reused to obtain more complete set of judgments and consequently a more reusable test collection: [1] We used test queries that did not contribute to the MS MARCO corpus.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.10865

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)

Add feedback

Overview of the TREC 2023 deep learning track

Craswell, Nick, Mitra, Bhaskar, Yilmaz, Emine, Rahmani, Hossein A., Campos, Daniel, Lin, Jimmy, Voorhees, Ellen M., Soboroff, Ian

arXiv.org Artificial IntelligenceJul-15-2025

This is the fifth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human-annotated training labels available for both passage and document ranking tasks. We mostly repeated last year's design, to get another matching test set, based on the larger, cleaner, less-biased v2 passage and document set, with passage ranking as primary and document ranking as a secondary task (using labels inferred from passage). As we did last year, we sample from MS MARCO queries that were completely held out, unused in corpus construction, unlike the test queries in the first three years. This approach yields a more difficult test with more headroom for improvement. Alongside the usual MS MARCO (human) queries from MS MARCO, this year we generated synthetic queries using a fine-tuned T5 model and using a GPT-4 prompt. The new headline result this year is that runs using Large Language Model (LLM) prompting in some way outperformed runs that use the "nnlm" approach, which was the best approach in the previous four years. Since this is the last year of the track, future iterations of prompt-based ranking can happen in other tracks. Human relevance assessments were applied to all query types, not just human MS MARCO queries. Evaluation using synthetic queries gave similar results to human queries, with system ordering agreement of $τ=0.8487$. However, human effort was needed to select a subset of the synthetic queries that were usable. We did not see clear evidence of bias, where runs using GPT-4 were favored when evaluated using synthetic GPT-4 queries, or where runs using T5 were favored when evaluated on synthetic T5 queries.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.0889

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection

Vendeville, Benjamin, Ermakova, Liana, De Loor, Pierre

arXiv.org Artificial IntelligenceMay-23-2025

The general public often encounters complex texts but does not have the time or expertise to fully understand them, leading to the spread of misinformation. Automatic Text Simplification (ATS) helps make information more accessible, but its evaluation methods have not kept up with advances in text generation, especially with Large Language Models (LLMs). In particular, recent studies have shown that current ATS metrics do not correlate with the presence of errors. Manual inspections have further revealed a variety of errors, underscoring the need for a more nuanced evaluation framework, which is currently lacking. This resource paper addresses this gap by introducing a test collection for detecting and classifying errors in simplified texts. First, we propose a taxonomy of errors, with a formal focus on information distortion. Next, we introduce a parallel dataset of automatically simplified scientific texts. This dataset has been human-annotated with labels based on our proposed taxonomy. Finally, we analyze the quality of the dataset, and we study the performance of existing models to detect and classify errors from that taxonomy. These contributions give researchers the tools to better evaluate errors in ATS, develop more reliable models, and ultimately improve the quality of automatically simplified texts.

large language model, natural language, simplification, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3726302.3730304

2505.16392

Country:

Europe > France (0.29)
North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine (0.95)
Education (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Synthetic Test Collections for Retrieval Evaluation

Rahmani, Hossein A., Craswell, Nick, Yilmaz, Emine, Mitra, Bhaskar, Campos, Daniel

arXiv.org Artificial IntelligenceMay-13-2024

Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recently gained significant attention in various applications. In IR, while previous work exploited the capabilities of LLMs to generate synthetic queries or documents to augment training data and improve the performance of ranking models, using LLMs for constructing synthetic test collections is relatively unexplored. Previous studies demonstrate that LLMs have the potential to generate synthetic relevance judgments for use in the evaluation of IR systems. In this paper, we comprehensively investigate whether it is possible to use LLMs to construct fully synthetic test collections by generating not only synthetic judgments but also synthetic queries. In particular, we analyse whether it is possible to construct reliable synthetic test collections and the potential risks of bias such test collections may exhibit towards LLM-based models. Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2405.07767

Country:

North America > United States > District of Columbia > Washington (0.05)
Europe > United Kingdom > England > Greater London > London (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Can We Use Large Language Models to Fill Relevance Judgment Holes?

Abbasiantaeb, Zahra, Meng, Chuan, Azzopardi, Leif, Aliannejadi, Mohammad

arXiv.org Artificial IntelligenceMay-9-2024

Incomplete relevance judgments limit the re-usability of test collections. When new systems are compared against previous systems used to build the pool of judged documents, they often do so at a disadvantage due to the ``holes'' in test collection (i.e., pockets of un-assessed documents returned by the new system). In this paper, we take initial steps towards extending existing test collections by employing Large Language Models (LLM) to fill the holes by leveraging and grounding the method using existing human judgments. We explore this problem in the context of Conversational Search using TREC iKAT, where information needs are highly dynamic and the responses (and, the results retrieved) are much more varied (leaving bigger holes). While previous work has shown that automatic judgments from LLMs result in highly correlated rankings, we find substantially lower correlates when human plus automatic judgments are used (regardless of LLM, one/two/few shot, or fine-tuned). We further find that, depending on the LLM employed, new runs will be highly favored (or penalized), and this effect is magnified proportionally to the size of the holes. Instead, one should generate the LLM annotations on the whole document pool to achieve more consistent rankings with human-generated labels. Future work is required to prompt engineering and fine-tuning LLMs to reflect and represent the human annotations, in order to ground and align the models, such that they are more fit for purpose.

judgment, relevance judgment, trec ikat 2023, (13 more...)

arXiv.org Artificial Intelligence

2405.056

Country:

Europe > Netherlands > North Holland > Amsterdam (0.05)
North America > United States > New York > New York County > New York City (0.05)
South America > Brazil > Bahia > Salvador (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.57)

Add feedback

CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems

Abolghasemi, Amin, Ren, Zhaochun, Askari, Arian, Aliannejadi, Mohammad, de Rijke, Maarten, Verberne, Suzan

arXiv.org Artificial IntelligenceMar-27-2024

An important unexplored aspect in previous work on user satisfaction estimation for Task-Oriented Dialogue (TOD) systems is their evaluation in terms of robustness for the identification of user dissatisfaction: current benchmarks for user satisfaction estimation in TOD systems are highly skewed towards dialogues for which the user is satisfied. The effect of having a more balanced set of satisfaction labels on performance is unknown. However, balancing the data with more dissatisfactory dialogue samples requires further data collection and human annotation, which is costly and time-consuming. In this work, we leverage large language models (LLMs) and unlock their ability to generate satisfaction-aware counterfactual dialogues to augment the set of original dialogues of a test collection. We gather human annotations to ensure the reliability of the generated samples. We evaluate two open-source LLMs as user satisfaction estimators on our augmented collection against state-of-the-art fine-tuned models. Our experiments show that when used as few-shot user satisfaction estimators, open-source LLMs show higher robustness to the increase in the number of dissatisfaction labels in the test collection than the fine-tuned state-of-the-art models. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems. We release our aligned counterfactual dialogues, which are curated by human annotation, to facilitate further research on this topic.

dialogue, satisfaction estimation, utterance, (15 more...)

arXiv.org Artificial Intelligence

2403.19056

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Testing different Log Bases For Vector Model Weighting Technique

Assaf, Kamel

arXiv.org Artificial IntelligenceJul-12-2023

Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.

information retrieval, natural language, test collection, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/ijnlc.2023.12301

2307.06213

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation

Yang, Jheng-Hong, Lassance, Carlos, de Rezende, Rafael Sampaio, Srinivasan, Krishna, Redi, Miriam, Clinchant, Stéphane, Lin, Jimmy

arXiv.org Artificial IntelligenceApr-4-2023

This paper presents the AToMiC (Authoring Tools for Multimedia Content) dataset, designed to advance research in image/text cross-modal retrieval. While vision-language pretrained transformers have led to significant improvements in retrieval effectiveness, existing research has relied on image-caption datasets that feature only simplistic image-text relationships and underspecified user models of retrieval tasks. To address the gap between these oversimplified settings and real-world applications for multimedia content creation, we introduce a new approach for building retrieval test collections. We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia. We formulate two tasks based on a realistic user model and validate our dataset through retrieval experiments using baseline models. AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research. Finally, the dataset provides the basis for a dedicated track at the 2023 Text Retrieval Conference (TREC), and is publicly available at https://github.com/TREC-AToMiC/AToMiC.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2304.01961

Country:

Europe > United Kingdom (0.14)
North America > Canada (0.04)
Europe > France (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback